[Feature] implement async LoRA prefetch by glenliu21 · Pull Request #14190 · sgl-project/sglang

glenliu21 · 2025-12-01T05:30:28Z

Motivation

This PR addresses #8712. I used the prefetch policy described in S-Lora, where LoRA adapters are prefetched based on what requests are on the Scheduler's waiting queue.

Modifications

Added max_loras_prefetch as a server argument
Implement creation of a ForwardBatch as a LoRA prefetch batch, which consists of requests that are next to be ran on the Scheduler's waiting queue
Implement the LoRA prefetch backend in LoRAManager, the memory pool, and the LoRA backend
Utilize ThreadPoolExecutor and a separate torch.cuda.Stream to enable async prefetching

Accuracy Tests

Added a basic, end-to-end test to ensure that enabling LoRA prefetching doesn't change expected outputs

Benchmarking and Profiling

@ConnorLi96 ran the following commands to benchmark LoRA prefetching:

for i in {1..16}; do curl -s -X POST http://0.0.0.0:30001/load_lora_adapter -H 'Content-Type: application/json' -d "{\"lora_name\": \"adapter${i}\", \"lora_path\": \"/workspace/adapters/llama_3_1_8B_adapter\"}"; echo " ✓ adapter${i}"; done
python3 -m sglang.bench_serving --backend sglang --base-url http://localhost:30001/ --dataset-name random --num-prompts 100 --request-rate 4 --random-input-len 2048 --random-output-len 1024 --disable-ignore-eos --disable-tqdm --lora-name adapter1 adapter2 adapter3 adapter4 adapter5 adapter6 adapter7 adapter8 adapter9 adapter10 adapter11 adapter12 adapter13 adapter14 adapter15 adapter16

This yielded the following results:

Before

----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22579.58  
Median E2E Latency (ms):                 22261.88  
---------------Time to First Token----------------
Mean TTFT (ms):                          16157.50  
Median TTFT (ms):                        15918.48  
P99 TTFT (ms):                           34927.59

After

----------------End-to-End Latency----------------                                         
Mean E2E Latency (ms):          17620.85                                          
Median E2E Latency (ms):         16273.82                                          
---------------Time to First Token----------------                                         
Mean TTFT (ms):             11926.84                                          
Median TTFT (ms):            10865.44                                          
P99 TTFT (ms):              26765.88

These show about a 31% decrease in TTFT and a 27% decrease in E2E latency.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] implement async LoRA prefetch#14190

[Feature] implement async LoRA prefetch#14190
glenliu21 wants to merge 5 commits intosgl-project:mainfrom
glenliu21:lora_prefetch

glenliu21 commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

glenliu21 commented Dec 1, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants